24 research outputs found
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Parallel Online Aggregation in Action
ABSTRACT Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple samplingbased estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration
Metformin Uniquely Prevents Thrombosis by Inhibiting Platelet Activation and mtDNA Release
Thrombosis and its complications are the leading cause of death in patients with diabetes. Metformin, a first-line therapy for type 2 diabetes, is the only drug demonstrated to reduce cardiovascular complications in diabetic patients. However, whether metformin can effectively prevent thrombosis and its potential mechanism of action is unknown. Here we show, metformin prevents both venous and arterial thrombosis with no significant prolonged bleeding time by inhibiting platelet activation and extracellular mitochondrial DNA (mtDNA) release. Specifically, metformin inhibits mitochondrial complex I and thereby protects mitochondrial function, reduces activated platelet-induced mitochondrial hyperpolarization, reactive oxygen species overload and associated membrane damage. In mitochondrial function assays designed to detect amounts of extracellular mtDNA, we found that metformin prevents mtDNA release. This study also demonstrated that mtDNA induces platelet activation through a DC-SIGN dependent pathway. Metformin exemplifies a promising new class of antiplatelet agents that are highly effective at inhibiting platelet activation by decreasing the release of free mtDNA, which induces platelet activation in a DC-SIGN-dependent manner. This study has established a novel therapeutic strategy and molecular target for thrombotic diseases, especially for thrombotic complications of diabetes mellitus
Active removal of waste dye pollutants using Ta[sub]3N[sub]5/W[sub]18O[sub]49 nanocomposite fibres
A scalable solvothermal technique is reported for the synthesis of a photocatalytic composite material consisting of orthorhombic Ta3N5 nanoparticles and WOxâ€3 nanowires. Through X-ray diffraction and X-ray photoelectron spectroscopy, the as-grown tungsten(VI) sub-oxide was identified as monoclinic W18O49. The composite material catalysed the degradation of Rhodamine B at over double the rate of the Ta3N5 nanoparticles alone under illumination by white light, and continued to exhibit superior catalytic properties following recycling of the catalysts. Moreover, strong molecular adsorption of the dye to the W18O49 component of the composite resulted in near-complete decolourisation of the solution prior to light exposure. The radical species involved within the photocatalytic mechanisms were also explored through use of scavenger reagents. Our research demonstrates the exciting potential of this novel photocatalyst for the degradation of organic contaminants, and to the authorsâ knowledge the material has not been investigated previously. In addition, the simplicity of the synthesis process indicates that the material is a viable candidate for the scale-up and removal of dye pollutants on a wider scale
GLADE-ML: A Database For Big Data Analytics
Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADE-ML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from traditional databases, GLADE-ML provides iteration management and explicit or implicit randomization in the execution strategy. GLADE-ML provides in-database analytics which outperforms other in-database analytics solutions by several orders of magnitude.GLADE-ML also introduces dot-product join operator in GLADE-ML. Dot-product join operator is specifically designed for Big Models. Big Data analytics has been approached exclusively from a data-parallel perspective, where data are partitioned to multiple workers â threads or separate servers â and model training is executed concurrently over different partitions, under various synchronization schemes that guarantee speedup and/or convergence. The dual -- Big Model -- problem that, surprisingly, has received no attention in database analytics, is how to manage models with millions if not billions of parameters that do not fit in memory. This distinction in model representation changes fundamentally how in-database analytics tasks are carried out.GLADE-ML supports model parallelism over massive models that cannot fit in memory. GLADE-ML extends the lock-free HOGWILD!-family of algorithms to disk-resident models by vertically partitioning the model offline and asynchronously updating the resulting partitions online. Unlike HOGWILD!, concurrent requests to the common model are minimized by a preemptive push-based sharing mechanism that reduces both the number of disk accesses as well as the cache coherency messages between workers. Extensive experimental results for three widespread analytics tasks on real and synthetic datasets show that the proposed framework achieves similar convergence to HOGWILD!, while being the only scalable solution to disk-resident models.Another distinct feature of GLADE-ML is Hyper-Parameter Tuning. Identifying the optimal hyper-parameters is a time-consuming process that the computation has to be executed from scratch for every dataset/model combination even by experienced data scientists. GLADE-ML provides speculative parameter testing which applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration
Recommended from our members
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics
Big Model analytics tackles the training of massive models that go beyond the
available memory of a single computing device, e.g., CPU or GPU. It generalizes
Big Data analytics which is targeted at how to train memory-resident models
over out-of-memory training data. In this paper, we propose an in-database
solution for Big Model analytics. We identify dot-product as the primary
operation for training generalized linear models and introduce the first
array-relation dot-product join database operator between a set of sparse
arrays and a dense relation. This is a constrained formulation of the
extensively studied sparse matrix vector multiplication (SpMV) kernel. The
paramount challenge in designing the dot-product join operator is how to
optimally schedule access to the dense relation based on the non-contiguous
entries in the sparse arrays. We prove that this problem is NP-hard and propose
a practical solution characterized by two technical contributions---dynamic
batch processing and array reordering. We devise three heuristics -- LSH,
Radix, and K-center -- for array reordering and analyze them thoroughly. We
execute extensive experiments over synthetic and real data that confirm the
minimal overhead the operator incurs when sufficient memory is available and
the graceful degradation it suffers as memory becomes scarce. Moreover,
dot-product join achieves an order of magnitude reduction in execution time
over alternative in-database solutions